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, Abstract 

■ Respondent-driven sampling is a widely-used network sampling technique, designed 

O I to sample from hard-to-reach populations. Estimation from the resulting samples is an 

area of active research, with software available to compute at least four estimators of 
a population proportion. Each estimator is claimed to address deficiencies in previous 
OO ' estimators, however those claims are often unsubstantiated. In this study we provide a 

simulation-based comparison of five existing estimators, focussing on sampling conditions 
which a recent estimator is designed to address. We find no estimator consistently out- 
p ^ I performs all others, and highlight sampling conditions in which each is to be preferred. 
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Respondent-driven sampling (RDS) ( Heckathorn . 19971 ) is currently a widely used method for 



sampling from hidden populations. A hidden population is a population for which it is difficult 
^ ' or impossible to compile a sampling frame, and hence traditional sampling and estimation 

, methods can not be used. However, it is often still important to collect information on these 

populations, for example to estimate the prevalence of HIV among injecting drug users in a 



'sj- ■ city. In this paper, we com pare five estimator s based on RDS data, under sampling conditions 



■ the estimator presented in iHeckathornl ()2007l ) is designed to address. In particular, we focus 
on the problem of estimating a population proportion in a hidden population with two groups, 
which we refer to as infected and uninfected. 

The basic method used to select a respondent-driven sample is as follows: First, a small 
number of population members are selected as seeds, typically selected from among individuals 
^ ' known to the researcher. Each seed is given a number of coupons, each of which has a unique 

■ bar-code, and is asked to pass on the coupons to other people they know within the population. 
When an individual has received a coupon, they are asked to report to a study centre where 
information of interest is collected by the researcher (such information is also collected from 
the seeds). A small monetary reward is often offered at this stage to encourage response. 
The responders are then themselves given coupons, and are asked to hand them on to others 
they know within the population, usually only to those who have not yet been recruited. In 
this manner, after the initial selection of seeds, the sampling is driven by the respondents. 
Those who report to the study centre are known to those who have already been selected, and 
recruitee-recruiter relationships can be determined from the bar-codes of the coupons. The 
information available on which to base an estimate is therefore information collected from the 
respondents and from the recruitment patterns. Respondents are usually asked how many 
people they know within the population of interest. This provides an estimate of degree, as 
described later. 

The original RDS paper (|Heckathornl . 1 19971 ^ suggested using the sample proportion as an 



estimator of population proportion (we refer to this as the "Naive" estimator), and showed 
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that this estimator is unbiased under very strong assumptions about the samphng process . 



Subs e quen t pape rs (jSalganik and Heckathornl . l2004l : IVolz and Heckathornl . l2008l : iHeckathornl . 
20071 : [Gii3 . 12OI0I I have relaxed some of these assumptions and proposed several alternative 
estimators. We will refer to these estimators as the Salqanik-Heckathorn fSH) estimator 
jSal.anik and Heckathornl . B. the Volz-Heckath orr^ fVH) est^mator ^Volz a.nd Heckathornl . 



20081 ), the Heck a thorn (H) estimator ( Heckathorn . 20071 ). and the Successive Sampling (SS) 



estimator ( Gile . 20ld'l . The SH- estimator has been implemented in the freely-available 
RDSAT software ( Volz et al.l . 2007 ) for several years, and is widely used by researchers im- 
plementing RDS. The H-estimator has been implemented in the latest version of this soft- 
ware, RDSAT 6.0 Beta. There is an urgent need to better understand the properties of the 
Heckathorn estimator in practice, and in comparison to the VH- a nd SH-estimators. Th e 
VH-, SS- and SH- estimators are implemented in the RDS R package iHandcock et all (|2009l V 
Although the estimators are easy to implement, the nature of their behaviour is not well 
understood. The main reason for this is that several of the assumptions which underpin the 
theoretical frameworks in which the estimators are derived and analysed are not met in prac- 
tice. For example, it is generally assumed that sampling is with replacem ent or that seeds are 
select ed randomly, whereas in practice these conditions almost never hold. iGile and Handcock 
(l2O10l ^ conducted a simulation study to study the VH- and SH-estimators under realistic sam- 
pling scenarios which violated these assumptions. Since this work was done the Heckathorn 
estimator has also been added to the RDSAT software and is already being used in practice. 
T his estimator is designed to address violations of sampling assumptions that were not studied 
Gile and Handcockl (|2O10l ^. 
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In particular, all previous estimators assume that sampled individuals always respond, 
that they always recruit other individuals when requested, and that they recruit uniformly at 
random from their acquaintances in the population. In practice, sampled individuals may not 
respond (non-response), may not always recruit others (imperfect recruitment effectiveness) 
and they may preferentially recruit neighbours with particular characteristics (differential 
recruitment). In this paper we investigate the effect these three deviations from the sampling 
assumptions have on the behaviour of the estimators. Thus, our comparison study is well- 
positioned to investigate the contributions of the H- estimator, which claims to adjust for 



differential recruitment and recruitment effectiveness (|Heckathornl . 12007 

In the remainder of this paper we first present the estimators we will compare, with a 
particular focus on the SH- and H- estimators. We then investigate in turn, via a simu- 
lation study, the effect of differential recruitment, imperfect recruitment effectiveness and 
non-response on the bias and variance of the estimators. In all these simulations the SH- and 
H- estimates were very similar to each other, so we extend the simulation study further to 
explore the conditions under which these estimates are likely to differ. Finally, we summarise 
the results and make some concluding remarks. 



2 Estimators of a Population Proportion 

In this section we briefly present the form of the five RDS estimators we consider in the 
simulation study. We use vTj to represent the true, unknown probability of selection for 
individual i. The set of individuals selected in the sample is denoted by s, and Sg denotes the 
intersection of s and the group g. We assume that the response variable is binary, and refer 
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to the levels as infected and uninfected. A is used to denote the set of all infected individuals, 
and B to denote the set of uninfected individuals. The population quantity of interest is the 
proportion of infected individuals, denoted P4. Following standard sampling notation, we 
denote population totals by upper-case letters and sample quantities by lower-case letters. 
Hence Ug = Ylieg'^i^ example. The population size is denoted by A^, and the sample 
size by n. We use di to denote the degree of individual i. The degree of an individual is 
considered to be the number of other individuals in the population to which that individual 
could potentially pass a coupon. 



Most of the estimators described in this paper are functions of Hansen-Hurwitz (jHansen and Hurwita . 



estimators, or, in the case of the SS estimator, of the closely related Horvitz-Thompson 



estimators. The Hansen-Hu rwitz estimator of a population total Y is unbiased and given by 
( Hansen and Hurwitz . 19431 ) 

Y = Y,7r7^m. (1) 

The generalised Hansen-Hurwitz estimator of a population mean, Y /N, is not unbiased but is 
asymptotically unbiased. We refer to any ratio of Hansen-Hurwitz estimators as a generalised 
Hansen-Hurwitz estimator. 

2.1 NaiVe Estimator 



The Naive estimate (jHeckathornl . 119971 ) is equal to the sample proportion of infected individ- 
uals, i.e. 

n = -- (2) 



n 



If the true sampling probabilities are given by 7rj,i = 1,...,A^, then is a generalised 
Hansen-Hurwitz estimator of 

Ha + Ub 

(an explanation is given in appendix E]) . 

Equation ([3]) shows that if the sampling probabilities of all individuals are equal, then is 
a generalised Hansen-Hurwitz estimator of Na/N. Thus, under very restrictive assumptions, 
P^ is asymptotically unbiased for pjl]- If the true probabilities of selection for infected 
individuals are greater than those of uninfected individuals, then it can be seen from ([3|) that 
P^ will be biased upwards. 

2.2 Volz-Heckathorn Estimator 



The Volz-Heckathorn (VH-) estimator (IVolz and Heckathornl . 120081 ) is given by 



1 /J ' ^ ' 



'^We use the term asymptotically unbiased somewhat reluctantly, as the asymptotics require infinite popu- 
lation size which is often a poor approximation to the scenarios where RDS is used. However, since asymptotic 
arguments are used to justify the use of these estimators, we retain them here, noting that it is precisely the 
failure of assumptions necessary for these asymptotics that necessitates studies such as this one. 
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where di denotes the reported degree of individual i. Using the same approach as in Section 
I2.H if the true selection probability of individual i is Tii, i = 1, . . . , N , it can be shown that 
PJ^ is a generalised H-H estimator of 

The VH-estimator is asymptotically unbiased for if vrj oc di for all i. Similarly to the case 
for P^, it can be seen from expression ([5]) that if the true probabilty of selection for infected 
individuals is larger than assumed, then PJ" is likely to be biased upwards. 



2.3 Successive Sampling Estimator 

The Successive Sampling (SS-) estimator ( Gile . 2O10l ) is very similar in form to the VH- 



estimator. The difference is in the estimation of the sampling weights. Rather than assuming 
sampling probabilities proportional to degree as in P^^, this estimator substitutes a function 
of the degree, based on approximating the sampling process as a successive sampling process. 
This process is without replacement, whereas the Naive and VH- estimators are derived 
based on the assumption of with-replacement sampling (among others). Therefore, the key 
contribution of the SS- estimator over the other estimators is the relaxing of the assumption 
of with-replacement sampling. Its form is given by: 



pss 



(6) 



where vrj denotes the estimated sampling proba bility of ind ividual i, estimated according to 
the successive sampling procedure described in Again, using the approach in 

Section [^TH if the true selection probability of individual i is tTj, i = 1, . . . , A^, it can be shown 
that P^^ is a generalised H-T estimator of 



(7) 



Note that here the Horvitz-Thompson estimator (jHorvitz and Thompsonl Il952l ) is used in 
place of the Hansen-Hurwitz estimator because of the without-replacement sampling assump- 
tiorll. Similarly to the case for P^ and PJ^, it can be seen from ([7|) that if the true probabilty 
of selection for infected individuals increases relative to uninfected individuals, P^ will tend 
to increase. 

This estimator also requires knowledge of t he populati on size, N. The sensitivity of the 
SS- estimator to the value of is considered in Gile For all simulations in this paper 

we assume t hat the true value, A" = 1000, is known, and compute estimates using the R 



package RDS ( Handcock et al. . 20091 ) 



^Although this estimator has the same mathematical form, the difference is that the sampling probabilities 
used are without-replacement sampling probabilities, as opposed to the draw-wise selection probabilities used 
in the Hansen-Hurwitz estimator. 
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2.4 Salganik-Heckathorn Estimator 

The Salganik-Heckathorn (SH-) estimator makes use of the fact that it is possible to measure 
the proportion of within-group and cross-group recruitments in the sample by keeping track 
of the barcodes of coupons distributed and returned. 
The Salganik-Heckathorn estimator is given by 

pji ^ ^ (8) 

Da 

Here Cab denotes the proportion of all individuals recruited by members of group A who are 
members of group B, and can be thought of as an estimate of the probability a randomly 

selected recruit in group A recruits from group B. Da is an estimate of mean degree given 
by 

-^def 

and is a H-H estimator of X^iGg S«eg ^'Z*^' ' ^ote that as the probability of selection of 

infected individuals increases relative to uninfected individuals, the ratio Db/Da will tend to 
decrease and hence, from dSl), will tend to increase. 

Because of its more complicated form, and the stronger assumptions required to derive it, 
the SH- estimator is relatively difficult to study analytically study compared to the Naive- or 
VH-estimators. 

2.5 Heckathorn Estimator 

This estimator is an extension of the SH-estimator, and was motivated by the ne ed to contro l 



for b iases introduced by differential recruitment and recruitment effectiveness ([Heckathorn 



Before introducing the form of the estimator, we need to consider the model used 



in its derivation: that of a Markov chain on degree groups. Degree groups are formed by 
partitioning the reported degrees into groups of contiguous degrees. Then assuming there is 
only one coupon, perfect recruitment effectiveness and no non-response, the sampling process 
is treated as a Markov chain. Here, time is indexed by the wave of the sample, and the state 
of the chain in a given wave is the degree group of the sampled individual. The transition 
probability from group g to g' is the probability that a coupon will be passed from an individual 
in group g to someone in group g' . Quite strong assumptions about respondent behaviour are 
required for this model to be appropriate, notably assumptions which are stronger than those 
required for the sampling process to be modelled as a Markov chain on nodes. 

The H- estimator is constructed by using an "adjusted" degree estimate AD a in place of 
the estimates of mean degree in ([8]). The adjusted degree estimate is defined as 

ada= ^-^y^^P- (10) 

RCDj is called the "recruitment component of degree" for individual and is formed based 
on degree groups defined by the reported degree of the respondents. RCDj is defined as 
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the ratio of the estimated degree group equihbrium probabihty and the degree group sample 
proportion. That is, 

E 

RCDj = -j^, for i in degree group g, (11) 

n 

where Ug is the number of respondents in degree group g and Eg is the estimated equihbrium 
probabihty of being in degree group g. For ah degree groups g, g\ the probability of transition 
for the Markov chain from g to g' is estimated by the proportion of recruitments from g to g' , 
Cggi . The estimated equilibrium probability of being in group q, E g , is then calculated from 



the matrix of estimated transition probabilities (jHeckathornl . 120071 . pg 171). 

The Heckathorn estimator uses the adjusted degree estimates in place of the unadjusted 
degree estimates in the SH-estimator, which gives 

^ AD^cTb + ADbCTa 
If RCDj = 1 for all i, then it can be seen from (|10p that ADg = Dg for all g, in which case 



pH pSH 

— -^A ■ I 1 . -| 

To define the degree groups we use the recommended method of iHeckathornI (j2007l ). to 

mimic as closely as possible what we believe to be the default method implemented in RDSAT. 
In this method the user specifies a "mean cell size" ric, which determines the "aggregation 
level" AL = y^n/ric- The range of degrees is then split into AL groups of approximately equal 
size. The default value is ric = 12, which is the value we use for all simulations carried out in 
later sections. 

It should be noted that if all recruits from group g only recruit others from group g, the 
estimated equilibrium probability Eg will be 1. This means that ADj^ will be based only on 
individuals within the "absorbing" degree group, which leads to instability. However, this 
case did not arise for any of the simulations presented in this paper. 



Comments on the Heckathorn Estimator 

ADa can be seen as some kind of adjustment for divergence from equilibrium of the Markov 
chain on degree groups. Suppose all the assumptions that are required for the sampling 
process to be described as a Markov chain on degree groups hold. Then if the Markov chain 
started in equilibrium, Ug/n is an unbiased estimator of Eg for all degree groups g. Therefore, 
from (jlip it can be seen that it is unlikely for the H- and SH- estimates to differ unless the 
Markov chain on degree groups does not start in equilibrium. This is likely to occur if seeds 
are chosen disproportionately from individuals of low or high degree, for example, or if only 
infected individuals are chosen as seeds and the degree distribution of infected and uninfected 
individuals differ. Further, we would expect to see larger differences between the H- and SH- 
estimates if the chain is out-of-equilibrium for a larger proportion of the sample, for example 
for smaller sample fractions or if convergence is slowed by homophily on degree-group. 

The above reasoning is based on the assumption that the sampling process is a Markov 
chain on degree groups. This is extremely unlikely to ever occur in practice, but may be a 
good enough model to inform our investigation. We comment further on this in Section [6l 
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2.6 Summary 



All the estimators considered substitute some estimate for the true sampling probabilites into 
a H-H or H-T estimator: the Nai've estimator assumes all vrj are equal, the VH- and SH- 
estimators assume vTj oc dj, and the SS- estimator uses a succesive sampling estimate of the 
TTj. Although all the estimators consid ered in this p aper can b e sho wn to be asymptotically 



irea in tnis p aper can o e sno wn to De asymptotically 
Neelvl ^200% and [Gild toid ) for details), in practice 



unbiased under certain conditions (see 
these conditions will not hold. Of interest is how the estimators compare when the assumptions 
are violated. Differential recruitment, imperfect recruitment effectiveness and non-response 
are all sampling behaviours which violate the assumptions under which the estimators were 
derived. How much these types of sampling behaviour affect the estimates is likely to depend 
on how great the resulting difference is between the true and estimated sampling probabilties 
(|Neelvl . l2009l l. We investigate this via a simulation study. 



3 Design of the Study 

In this section we present the simulation design used for all simulations presented in this paper. 
Because it is not possible to consider all configurations of network and sampling parameters, 
we concentrate on parameterisations which are likely to approximate real-life conditions. In 
particular, the level of homophily, mean degree and the number of seeds were chosen to match 



the characteristics of the pilot d ata from the CDC surveillance program (lAbdul-Quader et al 
20061 : ICile and Handcockl . boid ) . 



The population is modelled as an undirected network with 1000 nodes, where each node 
represents one individual in the populatiord. An edge between two nodes i and i' means that 
there is a non-zero probability that i will recruit i' when given a coupon, and vice versa. The 
set of potential recruits of an individual i is therefore given by the neighbours of node i, and 
the number of neighbours is equivalent to the degree of i, di. 

For all simulations 200 of the nodes are infected, so Pa = 0.2. The overall mean degree 
of the network is equal to seven. There is a moderate amount of homophily in all the net- 
works. Specifically, we fix the relative probabilities of edges between two infected nodes and 
between and infected and an uninfected node, such that the former is five times as likely. In 
the case of 20% infected nodes and no differential activity (defined below), this implies the 
probability of an uninfected-uninfected edge is twice the probability of an uninfected-infected 
edge. Differential activity (DA) is defined as the ratio of the mean degree of infected nodes 
to the mean degree of uninfected nodes, i.e. Da/Db- This parameter is varied throughout 
the simulations and takes values in {0.5, 1, 1.8}. Parameter values given in bold face denote 



the default values. Networks were simulated using the R package statnet (|Handcock et al 
20031 . 12OO8I I. 



Given a population connected by the network, we simulate a respondent-driven sampling 
process. In every simulation we draw ten seeds. Unless otherwise stated, seeds are selected 
at random with probability proportional to degree from all nodes. Sampling is done without 
replacement, and all respondents are given two coupons. We consider samples of size 200 
and 500, but unless there are qualitative differences we only present the simulation results 
for samples of size 200. The parameterisation of the sampling process relating to differential 



^From now on we refer to nodes and individuals interchangeably. 
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recruitment, recruitment effectiveness and non-response will be discussed in more detail in 
the relevant sections. 

For any given value of the population and sampling parameters, the simulations are run 
as follows: 

1. Simulat e 1000 networks. T o allow for comparison, we used the same networks as were 
used in iGile and Handcockl (|20ld l. 

2. For each network, simulate one respondent-driven sample. We therefore have 1000 
samples in total. 

3. For each sample, calculate the estimates P^, P^, PJ^, and P|. 

The mean of the 1000 estimates from any estimator is a measure of the expected value of 
that estimator under the given population and sampling parameters. The variance of the 
estimates is a measure of the sampling variation of that estimator in addition to how sensitive 
the estimator is to variations in population structure. 



4 The Effect of Differential Recruitment 

Whereas the sample design is determined by the survey organiser, differential recruitment is 
a property of respondent behaviour which is not possible for the survey organiser to observe 
directly. Therefore, it is important to consider how robust the estimators are to the presence 
of differential recruitment. 

In this section we show that the bias of all the estimators is significantly affected by 
differential recruitment. The direction of this bias can be explained by considering how 
differential recruitment affects the probabilities of selection vTj. 

We first define the types of differential recruitment which we will consider in this paper. 

Definition 1 (Differential recruitment) Suppose the proportion of neighbours of node i 
which belong to group g is equal to pgi, for all g ^ G. Then differential recruitment exists if 
and only if the probability i passes a given coupon to a neighbour from group g is not equal to 
Pgi for any g £ G and any i. 

In other words, differential recruitment exists if a respondent is more (or less) likely to pass 
on a coupon to neighbours of some group than if he were choosing uniformly at random from 
his neighbours. We will consider two types of differential recruitment: 

1. Within- group differential recruitment, when nodes preferentially recruit neighbours from 
within their own group, and 

2. Between- group differential recruitment, when all nodes preferentially recruit nodes from 
a particular group. 

For each type of differential recruitment we distinguish between infection-group differential 
recruitment and degree- group differential recruitment. It is useful to note that either type 
of differential recruitment can induce the other if the degree distributions of infected and 
uninfected nodes are not the same. 

We first investigate the influence of within-group differential recruitment in some detail, 
then consider between- group differential recruitment. 
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4.1 Within-group Differential Recruitment 



We model within-group differential recruitment with the following assumption: 

Assumption 1 Suppose that node i is a member of group g, and is currently recruiting. 
Then the probability that node i recruits a neighbour in group g is proportional to Kg, and the 
probability that node i recruits a neighbour from some other group g' ^ g is proportional to 1. 



This is similar to the two- group model of Goel and Salganik ( 20091 ). except that we allow the 



strength of differential recruitment to vary by group and do not assume that every individual 
has a non-zero probability of recruiting any other individual. Clearly if differential recruitment 
is greater for individuals in group g than for individuals in group g', i.e. Kg > Kgi, then 
the sampling probabilities of group-(7 individuals will increase relative to those of group-^' 
individuals. If differential recruitment exists but is similar for both groups, i.e. Kg ~ Kg/, then 
the relative sampling probabilities of nodes in groups g and g' should not change dramatically. 

The effect of within-degree group differential recruitment on the estimators of Pa can 
therefore be seen to depend on the distribution of degree among infection groups. If it is the 
same, then within-degree-group differential recruitment will not induce within-infection group 
differential recruitment, and hence will not cause additional bias when estimating P^. If the 
distribution of degree is not the same among infection groups, for example when there ex- 
ists differential activity, then within-degree-group differential recruitment will induce within- 
infection-group differential recruitment. For example, suppose differential activity is equal 
to 1.8, so infected nodes have higher mean degree than uninfected nodes. If nodes of high- 
degree prefer to recruit other nodes of high-degree, this will induce differential recruitment by 
infection group, because infected nodes will preferentially recruit other high-degree infected 
nodes. 

Note that it is not always possible to distinguish from the sample if differential recruitment 
exists, because its effect on the resulting sampling chain is similar to that of homophily - the 
proportion of within-group edges. For example, suppose the proportion of recruitments from 
group A to group B, Cab = 0.5. One possibility is that 50% of the neighbours of group A 
nodes are in group B, and recruitments were made uniformly at random. Alternatively, it 
might be the case that 70% of the neighbours of group A nodes are in group B, but that 
group A nodes preferentially recruited other group A nodes. Thus, sampling probabilities for 
a given group will increase with either differential recruitment or homophily. 

Based on the properties of the estimators considered in Section [21 we can predict whether 
the estimates will increase or decrease with increasing within-group differential recruitment. 
We would expect that as Ka increases relative to Kb, that all of the estimates will also 
increase. 



Results 

Results of the simulations for three levels of differential activity are shown in Figure [TJ Note 
that the counts at the top of the boxplot denote the number of samples for which the estimate 
was equal to one. Counts at the bottom of the boxplot denote the number of samples for 
which an estimate could not be calculatecQ. 

*The SH- and H- estimators will return a value of 1 if recruitments were made from infection group A to 
B, but not from B to A. In every case that an estimate could not be computed, it was due to there being no 
recruitments at all from at least one of the infection groups. 
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Figure 1: Simulation results for varying levels of within-infection-group differential recruit- 
ment and differential activity. Differential recruitment is coded as {Kb,Ka)- 



For all estimators and for all levels of differential activity, as increases relative to 
Kb the estimates increase, as expected. It can also be seen that the VH-, SH-, H- and SS- 
estimates are significantly higher or lower than the sample proportion for differential activity 
levels of 0.5 and 1.8 respectively. The SS- and VH- estimators are generally less biased than 
the SH- and H- estimators, and the VH- and SS- estimators also have a lower variance than 
the SH- and H- estimators. Other patterns, which are expected from previous studies, are that 
the SS-, V H- and SH- estimators cor r ect for diffe rential activity (in constrast to the Nai've 
estimator) ( Gile and Handcock . 20 id : Gile, 2010), and that the SS- estimates lie between 
those of the Naive and VH- esitmators (Gile, 20ld ). However, there is no evidence that any 
of the estimators are adjusting for differential recruitment, because the estimates increase 
or decrease by approximately the same amount as the sample proportion as the differential 
recruitment level is varied for a given level of differential activity. 

There are no significant differences between the H- and SH- estimates for any of the pa- 
rameter combination^. We also ran simulations with Ka = Kb ^ 1 (results not shown), and 
these verified that the estimates are very similar to the case that Ka = Kb = 1, as expected. 



In order to parameterise within-degree-group differential recruitment for our simulations, 
we assumed that all nodes preferentially recruit other nodes of similar degree to themselves. 
In this way, no matter how the boundaries of the degree groups are defined by the practitioner, 
within-degree-group differential recruitment will exist. Figure [2] shows the probability that a 
node of degree i will recruit a node of degree j, for all j and some i, for the simulations with 
within-degree-group differential recruitment. 

The simulation results are shown in Figure [31 From this figure it can be seen that, 

^AU claims of statistical significance are based on paired t-tests of the null- hypothesis of no difference in 
mean at the 5% level of significance with Bonferroni correction for multiple comparisons. 
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degree of recruitee 



Figure 2: Relative probabilities that a node of degree d will recruit a node with degree shown 
on the X-axis for the simulations with within-degree-group differential recruitment, for several 
values of d. 

for the parameterisation studied in this paper, the effect of within-degree-group differential 
recruitment on the bias of the estimates is small. In the absence of differential activity, within- 
degree group differential recruitment has no significant effect on the bias of the estimators. 
When there exists differential activity, then within-degree group differential recruitment re- 
duces (DA = 0.5) or increases (DA = 1.8) the mean of each estimator. However, the effect 
of changing levels of differential recruitment is quite small. This can be explained by the 
fact that the induced levels of within-infection-group differential recruitment are likely to be 
similar for both infected and uninfected nodes, i.e. ~ Kb- Hence, from the results of 
within-infection-group differential recruitment discussed above, the estimators will not show 
substantial additional bias. 

Similarly to the case for within-infection-group differential recruitment, the variance of 
the SS-estimator is similar to that for the VH-estimator, and the variance of the SH- and H- 
estimators is slightly greater. There was only one significant difference between the means of 
the H- and SH- estimates: when there was both differential recruitment within degree group 
and differential activity equal to 1.8 (adjusted p- value 0.0018). In this case the mean estimate 
returned by the H- estimator was 3.7 x 10~^ greater than the mean for the SH- estimator. 

4.2 Between-group Differential Recruitment 

Suppose infected nodes are preferentially recruited by all nodes. In this case infected nodes 
will be more likely to be sampled than if there were no differential recruitment, so the true 
sampling probability of infected nodes will be higher than that assumed by the estimators. 
Therefore, following the arguments of Section [21 we would expect all the estimates to increase. 

For our simulations, we parameterised differential recruitment between infection-group by 
the ratio of the probability a node will recruit an infected node relative to uninfected node. 
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Figure 3: Simulation results for varying levels of differential activity, with and without within- 
degree- group differential recruitment. 

Possible values are {2, 1, 0.5}. Results of the simulations for three levels of differential activity 
are shown in Figure [H This shows that for any fixed level of differential activity, as the prob- 
ability an infected node will be recruited increases relative to the probability an uninfected 
node will be recruited, all of the estimates of Pa also increase. There is no evidence that any 
of the estimators considered correct for differential recruitment in this case, although the SS- 
and VH- estimators tend to be slightly less biased than the SH- and H- estimators. The mean 
of the estimates changes by about the same amount as for the sample proportion as the level 
of differential recruitment between infection-group is changed. 

Now consider between-degree-group differential recruitment. If high-degree nodes are pref- 
erentially recruited by all nodes then this is likely to influence the estimates of Pa only if the 
occurrence of infection is greater or less in nodes of high-degree than in the general popu- 
lation. That is, we would expect between-degree-group differential recruitment to affect the 
estimates of Pa only if the degree distribution of infected nodes is different from the degree 
distribution of uninfected nodes, for example if there exists differential activity. In this case 
between-degree-group differential recruitment will induce between-infection-group differential 
recruitment, and we would expect the corresponding bias of the estimators. 

For these simulations, we used three levels of differential recruitment: none, low degree 
nodes are preferentially recruited ("low"), and high-degree nodes are preferentially recruited 
("high"). For the latter two cases, the relative probabilities that a node of any degree will 
recruit a node of degree d are illustrated in Figure \5\ Results of the simulations for the 
three levels of between-degree-group differential recruitment and three levels of differential 
activity are shown in Figure [6l From this figure we can see that when low-degree nodes are 
preferentially recruited this results in an increase in the mean of the estimators if differential 
activity is equal to 0.5 and a decrease in mean of the estimators if differential activity is equal 
to 1.8. This is as we would expect based on the induced differential recruitment between 
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Figure 4: Simulation results for varying levels of between-infection-group differential recruit- 
ment and differential activity. 

infection groups. There is no evidence that any of the estimators correct for this type of 
differential recruitment. 

5 Recruitment Effectiveness and Non-response 

In this section we show that all estimators behave in a similar and predictable way in the 
presence of non-response. We also explore the behaviour of the estimators when there is 
imperfect recruitment effectiveness, and find that the SH- and H- estimators behave differently 

to the other estimators considered. 

Wc define recruitment effectiveness and non-response as follows: 

Definition 2 (Recruitment efTectiveness) Denote by Rg the probability that a coupon 
given to an individual in group g is passed on to another candidate individual in the pop- 
ulation, given that such an individual exists. Then Rg is the "recruitment effectiveness" of 
group g. 

Definition 3 (Non-response) Denote by Vg the probability that an individual in group g 
reports to the study centre after having been given a coupon. Then if Vg 1 we say there 
exists non-response, and refer to Vg as the "response rate" of group g. 

Recruitment effectiveness and response rate are closely linked. If a coupon given to an indi- 
vidual is not returned, this could be either because that individual does not use the coupon 
to recruit anybody (imperfect recruitment effectiveness, Rg < 1), or because the person they 
recruit does not report back to the study centre (non-response, < 1). It is not always 
possible to tell reliably for which reason a coupon was not returned, nor to directly estimate 
recruitment effectiveness and non-response rates. 
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degree of recruitee 



Figure 5: Relative probabilities of recruitment for nodes, for different levels of differential 
recruitment between degree-group. The dashed line corresponds to preferential recruitment 
of low-degree nodes, and the solid line to preferential recruitment of high degree nodes. As 
with all simulations, the mean degree of all nodes is equal to 7. 

In this section we consider how imperfect recruitment effectiveness and non-response are 
likely to affect the estimators, and carry out a simulation study to test these arguments. 

5.1 Changes in the Selection Probabihties 

Suppose that node j & g is given a coupon and there is probability Rg < 1 that he uses it to 
recruit another participant. If j does not pass on the coupon, then the children of j miss out 
on a chance to be recruited. Therefore, if iig < 1 the children of nodes in group g will have a 
lower probability of selection than if Rg were equal to 1. Hence in general one would expect 
that if Rg < Rg', nodes who have many parents in group g will have a lower probability of 
selection that nodes with many parents in group g' , everything else being equal. If this effect 
is not balanced between infection groups, and in the presence of homophily, it is likely to 
result in biased estimates of Pa- Because recruitment effectiveness does not affect the choice 
of recruit given that a coupon is passed on, changes in recruitment effectiveness would not be 
expected to affect the cross-group recruitment probabilities used in the SH- and H-estimators. 

5.2 Additional effect of Non-response 

Rather than changing coupon-passing probabilities, non-response affects which coupons are 
returned. Hence rather than reflecting the population proportion of infected individuals. Pa 
will "estimate" the proportion of infected individuals among the population of respondents. 
Suppose we denote the set of respondents in the population by R. Then by definition [3l 
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Figure 6: Simulation results for varying levels of between-degree-group differential recruitment 
and differential activity. The labels indicate whether low ("low") or high ("high") degree nodes 
are preferentially recruited. 

Vg = NgQji/Ng Biid SO thc population proportion of infected individuals can be written as 

N NAnR + NsnR^ 

(derivation given in appendix|Bj). If Va < Vb, Na/N will be greater than Nahr/^r- Thus 
if there are no other sources of bias, all the estimators will be biased downward. If > Vb, 
all estimators will be biased upward. 

Similarly, non-response will affect the estimated cross-group recruitment probabilities. 
This is because the estimates are based only on the recruitments to nodes who actually re- 
spond, which is a subset of all recruitments actually made. For example, if infected nodes have 
a lower response-rate than uninfected nodes, then the estimated probability of an uninfected 
node recruiting an infected node will be reduced compared to the case of perfect response. 

5.3 Results 

5.3.1 Recruitment Effectiveness 

For our simulations, recruitment effectiveness by infection group is parameterised as a vector 
{Rb,Ra), where Rb and Ra are the probability that a coupon is passed on given that a 
candidate for recruitment exists, for uninfected and infected nodes respectively. We consider 
values {(0.6, 0.9), (1,1), (0.9, 0.6)}. 

Recruitment effectiveness by degree group is parameterised in the form (a, /?), where nodes 
of degree 1-5 have relative probability a of recruiting, nodes of degree greater than 10 have 
relative probability /3 of recruiting, and the relative probability for nodes of degree 6-10 
increases linearly from a to (3. Possible values are {(0.5, 1), (1, 1), (1,0.5)}. 
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Results of the simulations for the different levels of recruitment effectiveness by infection 
group are shown in Figure [71 Figure [7] shows that, for all levels of differential activity, as the 




Figure 7: Simulation results for varying levels of recruitment effectiveness by infection group 
and differential activity. 

recruitment effectiveness of uninfected nodes increases relative to the recruitment effective- 
ness of infected nodes, the sample proportion of infected nodes decreases. This is as expected 
due to the population homophily. The VH-estimates and SS- estimates also decrease as the 
sample proportion decreases, demonstrating that they are biased by different levels of recruit- 
ment effectiveness. However, it is interesting to note that neither the SH- nor the H-estimates 
appear to be influenced to the same extent by recruitment effectiveness as the VH-, SS- or 
Nai've estimates. If there is no differential activity then the SH- and H- estimators appear 
to effectively control for different recruitment effectiveness within infection groups. In the 
presence of differential activity there is a slight bias in the opposite direction to the Naive 
estimator, but this tends to be small relative to the bias of the other estimators. We also 
ran the simulation for a sample size of 500. In this case, shown in Figured! the H- and SH- 
estimators have a substantial bias in the opposite direction to the Naive and VH- estimators. 
Thus the simulation results suggest that the SH- and H- estimates correct for bias due to 
differential recruitment in small sample fractions. Again, there is no significant difference 
between the means of the SH- and H- estimators. 

In the case of different levels of recruitment effectiveness by degree-group, shown in Figure 
[H a similar effect can be seen. In this case, if there is no differential activity then different 
levels of recruitment effectiveness by degree group does not bias the estimators of Pa, be- 
cause the induced recruitment effectiveness by infection-group is the same for infected and 
uninfected groups. For differential activity greater than 1, if low-degree nodes have higher 
recruitment rates than high-degree nodes (parameterisation (1,0.5)), this means that on av- 
erage uninfected nodes will have higher recruitment rates than infected nodes. Accordingly 
the sample proportion of infected individuals decreases as recruitment effectiveness by degree- 



16 



1^ -'f 




' 4. + T 


m J 






■■■■ 







: \ ^ 



■ 



i 



at 



■ N 

□ ss 

□ VH 

n SH 

□ H 



■ 

t.t fl.. 

HB 



1 1 



DA 0.5 
REinf (0.6,0.£ 



0.5 

(1,1) 



0.5 1 
(0.9,0.6) (0.6,0.9) 



1 

(1,1) 



1 1.8 

(0.9,0.6) (0.6,0.9) 



1.8 1.8 

(1,1) (0.9,0.6) 



Figure 8: Simulation results for varying levels of recruitment effectiveness by infection group 
and differential activity, for samples of size 500 rather than 200. 

group changes from (0.5, 1) to (1,0.5) for differential activity equal to 1.8, and the reverse is 
true if differential activity is equal to 0.5. Therefore, the patterns of bias are similar to the 
case of recruitment effectiveness by infection group. The bias of the H- and SH- estimators 
increases accordingly for samples of size 500 (not shown). 

The difference in mean between the SH- and H- estimators is statistically significant when 
recruitment effectiveness by degree-group is equal to (0.5, 1) and differential activity is 0.5 or 
1.8. In both of these cases the mean of the H-estimates was closer to the true value than the 
mean of the SH-estimates. However, from the simulation results of Figure [9l this difference 
may not be large enough to be of practical significance. 

5.3.2 Non- response 

Non-response by infection group is parameterised as a vector (Vb, Va), where Vb and Va are 
the probability that a node responds given that they have been recruited, for uninfected and 
infected nodes respectively. Possible values are {(0.6,0.9), (1, 1), (0.9,0.6)}. Non-response by 
degree group is parameterised in the same way as recruitment effectiveness by degree group. 
Possible values are {(0.5, 1), (1, 1) , (1, 0.5)}. 

Results of the simulations for the different levels of non-response by infection group and de- 
gree group are shown in Figures [10] and [11] respectively. From Figure [TO] as the response-rate 
decreases for infected nodes relative to uninfected nodes, i.e. the parameterisation changes 
from (0.6,0.9) to (0.9,0.6), the sample proportion of infected nodes decreases, as expected. 
Again, all estimators are biased by changes in the relative rates of non-response. As the 
sample proportion of infected individuals decreases, so too do the Naive, SS-, VH-, SH- and 
H- estimates. There are no significant differences between the mean SH- and H- estimates. 

For the case of non-response by degree-group, similarly to the case of recruitment effec- 
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Figure 9: Simulation results for varying levels of recruitment effectiveness by degree group 
and differential activity. 

tiveness by degree-group, the results can be interpreted in terms of induced non-response 
by infection-group. Viewed in this way, the results shown in Figure [TT] are consistent with 
those of Figure [TUl There are significant differences between the SH- and H- estimates when 
there exists both non-response by degree group and differential activity, but again these dif- 
ferences don't appear to be large. None of the estimators seem to control for different rates 
of non-response between infected and uninfected groups. 

6 Comparison of the Salganik-Heckathorn and Heckathorn Es- 
timators 

In Section r2.5l we hypothesised that the SH- and H- estimates would differ if the Markov chain 
on degree groups was started out of equilibrium. Further, the more slowly the chain reaches 
equilibrium, the larger the expected difference between the SH- and H- estimates for a fixed 
sample size. 

In the simulations presented in Sections [4] and [5] there were very few statistically significant 
differences between the mean SH- and H-estimates. Those differences which were significant 
always occurred when there was differential activity and differences in recruitment or response 
behaviour between degree groups. These are the cases where the equilibrium probabilities of 
the Markov chain on degree groups are likely to be furthest from the degree distribution of the 
seeds. However, in all simulations the differences between the mean (and median) estimates 
of the H- and SH- estimators were unlikely to be large enough to be of practical concern (all 
mean differences were less than 0.001), and the variances of the estimates were also nearly 
identical. 



HeckathornI (j2007l . pp. 185) admits that traditional RDS designs are likely to result in 



only "modest" differences between the two estimators. However, he theorises that 
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Figure 10: Simulation results for varying levels of non-response by infection group and differ- 
ential activity. 

". . . changes in research design that induce an association between degree and 
opportunities to recruit would produce much larger potential effects." 

A design that gave respondents with higher reported degree more coupons than respondents 
with low reported degree would have such an effect. Such a design can be viewed as equivalent 
to one where all nodes are given the same number of coupons but nodes of lower degree have 
much lower recruitment effectiveness. This case was considered in Section \5\ Figure [H and 
again produced only very small differences between the H- and SH- estimates. Similarly, a 
design where infected nodes are given more coupons than uninfected nodes is equivalent to the 
case we considered in Figure [71 where uninfected nodes have lower recruitment effectiveness 
than infected nodes. Again, only very small differences between the H- and SH- estimates 
were observed. We also re-ran the simulations using four coupons rather than two, but this 
did not produce any significant changes in the results. 

To try and induce larger differences between the estimates we looked at the effect of 
increasing homophily (to slow convergence) , preferentially selecting seeds from nodes of low or 
high degree (to start the chain out of equilibrium) , and more extreme differences in recruitment 
effectiveness. The largest differences between the SH- and H- estimates arose when the seeds 
were selected from nodes of low degree. Compared to the effect of seed selection, increasing 
differential activity or homophily had little effect on the difference between the estimates. 
Simulation results for the case that ten seeds are chosen uniformly at random from the twenty 
nodes of lowest degree are shown in Figure [T2j The observable differences between the H- and 
SH- estimates when seeds are selected from low-degree nodes are statistically significant. It is 
also very interesting to note that in these cases the H-estimator appears to correct somewhat 
for the bias introduced by the seed selection mechanism. 

To investigate further, we considered the differences in absolute deviations from 0.2 of 
the H- and SH- estimates. Figure [13] shows that on average the H- estimate is closer to the 
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Figure 11: Simulation results for varying levels of non-response by degree group and differen- 
tial activity. 

true value of 0.2 than the SH- estimate when seeds are chosen from nodes of either high 
or low degree. This observation is robust to changes in sample-size. The simulation results 
therefore suggest that the H- estimator corrects for bias due to selecting seeds with degrees 
not representative of the theoretical equilibrium distribution on degree groups. If there are 
no other sources of bias, the H- estimator is likely to be less biased than the SH- estimator 
in this case. However, it should be noted that the seed selection scenarios which produced 
the observed differences in behaviour were quite extreme, and are unlikely to occur unless by 
design. 

We also considered how the SH- and H- estimators behave when seeds are selected from 
nodes of low degree and, in addition, there exists differential recruitment, unbalanced re- 
cruitment effectiveness or non-response. In this case the biases due to seed selection and 
recruitment effectiveness or non-response behaved in an approximately additive fasion. Hence 
depending on the components of the additive bias, either the SH- or H- estimates could have 
the lower total bias. 

In practice, a scenario as extreme as that illustrated in Figure [12] is unlikely to arise. 
More realistically, it could happen that there exists differential activity and that seeds are 
selected only from infected individuals. If infected nodes have a lower or higher average 
degree than uninfected nodes, then this will induce the same effect as considered above. We 
ran the simulations for this case and, although the difference between the mean of the H- 
and SH- estimators was statistically significant for the cases with differential activity and 
seeds chosen only from infected nodes (and only those cases), the size of the differences 
(approximately 0.001) is unlikely to be of practical concern. This indicates that it is unlikely 
that these conditions will result in large differences between H- and SH- estimates in traditional 
respondent driven sampling designs. 
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Figure 12: Box-plots of the estimates for different methods of selecting seeds. Seeds are either 
selected with probabihty proportional to degree or uniformly at random from the twenty 
nodes of lowest degree. 

7 Discussion 

In this paper we compare the performance of several respondent driven sampling estimators 
under conditions of differential recruitment, recruitment effectivness and non-response. Our 
specific findings are su mmarized in Table [Tl Of particular importance is the performance of the 



estimator introduced in lHeckathornI ([2007]), which claims to adjust for differential recruitment 
and recruitment effectiveness. 

We did not find any evidence that the H-estimator adjusts for differential recruitment 
or non-response. However, in the case of recruitment effectiveness, the resulting bias of the 
H-estimator and the SH- estimator is in the opposite direction to that of the Naive, SS- and 
VH- estimators, and the size of this bias depends on the sampling fraction (see Figures [7] 
and [8]). For the simulations with sample size 200, this resulted in the SH- and H- estimators 
effectively controlling for the bias due to imperfect recruitment effectiveness. 

When seeds were randomly selected with probability proportional to degree, under all con- 
ditions of differential recruitment, recruitment effectiveness and non-response the estimates 
of the H-estimator were almost identical to those of the SH-estimator. In Sections H] (differ- 
ential recruitment) and [5] (recruitment effectiveness and non-response), we found statistically 
significant differences between these two estimators in only 10 of 69 simulation cases (sample 
size 200). The magnitudes of the differences in these cases were small enough to be negligible 
for practical purposes. 

This led us to explore, in Section [Gj other conditions which might lead to differences 
between the H- and SH- estimates. Here, we found that we can induce differences between 
the two estimators when the following two conditions both hold: 

1. The seeds are selected with probability far from the theoretical equilibrium probabilities 
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Figure 13: Box-plots of the difference in absolute deviations from 0.2 of the estimates returned 
by the SH- and H- estimators when seeds are chosen either from among the nodes of lowest 
or highest degree. 



of the Markov chain on degree groups; and 

2. The degree distribution of infected and uninfected nodes is not the same (for example 
if there is differential activity). 

Under these conditions the H-estimator was significantly less biased than the SH-estimator, 
indicating that the H-estimator may correct for the bias introduced by this type of seed 
selection. Specifically, we found the largest differences between the estimates of the H- and 
SH- estimators when seeds were selected from among the nodes of lowest degree. Smaller, 
but substantial differences were also apparent when selecting seeds from among nodes of the 
highest degree. It is possible this scenario may arise by design, where study designers try to 
recruit well-connected population members as seeds. 

If there are no sources of bias other than unbalanced selection of seeds by degree, then 
the H- estimator is likely to have a smaller bias than the other estimators considered in this 
paper. However, if other sources of bias are present (which will almost always be the case), 
then these will interact to mean that the absolute bias of the H- estimator may not be the 
smallest of the estimators considered. 

All estimators were subject to bias induced by differential recruitment, recruitment effec- 
tiveness and non-rcpsonsc. It is interesting, however, that when the differential recruitment 
and recruitment effectiveness acted only on degree groups, bias was only present in the esti- 
mators when the infection groups had different degree distributions (differential activity ^ 1). 
Without this condition, over or under representing degree groups and their sampling patterns 
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in the estimators does not differentially affect the two groups. Across these simulations, for 
random selection of seeds the H- and SH- estimators tended to perform similarly in each 
case. The Naive estimator (sample mean) tended to exhibit the largest bias, whereas the SS-, 
VH-, SH- and H- estimat ors showed reduced bias due to adjusting for differential activity 
dOile and Handcockl . hoid ). For all cases the SS- estimates fell between the Naive and VH- 
estimates. In general, apart from the conditions of seed selection mentioned above, none 
of the SS-, VH-, SH- or H- estimators were consistently less biased than any other. They 
also had similar variances, although variance of the VH- and SS- estimators appeared to be 
slightly less in many cases, and the estimates returned by the SS- and VH- estimators were 
not subject to the same instability as the SH- and H- estimates. 

Overall, our findings, indicate that no estimator consistently out-performs the others under 
all conditions. We find that with small sample fractions, the SH- and H- estimators correct for 
differential recruitment effectiveness across groups, which induces bias in the other estimators 
in the presence of homophily. This is an advantage, and therefore these estimators should 
be used whenever the sample fraction is small and differential recruitment effectiveness is 
suspected. However, these estimators also are also subject to more instability, and have 
larger variance and greater bias than the VH- or SS- estimators under man y other sample 
conditions, both in this paper, and as described in iGile and Handcockl toid ). This suggests 
that absent differential recruitment effectiveness, the VH- or SS- estimators are to be pr e ferred . 
In the case of large sample fractions, the SS- estimator is clearly to be preferred (Gile, 20ld ). 
Our results do not provide evidence that the H-estimator should be preferred over the SH- 
estimator. Although we found differences between the H- and SH- estimators in some extreme 
cases, we suspect these cases are rare enough so as to not justify the additional complexity of 
the H-estimator. The lack of clear preference among these estimators highlights the need for 
both new estimators, and also for new diagnostic tests aimed at identifying circumstances in 
which each estimator is to be preferred. 

It is also important to keep in mind the limitations of this study. It was not possible to 
consider all possible combinations and levels of the parameterisations of the effects studied 
in this paper. We have tried to focus on a range of simulation conditions and parameter 
values most likely to highlight contributions of the H-estimator, and considered variations of 
many parameters at once to draw out potential interaction effects. Nevertheless, there may 
well be other population or sampling parameters whose levels will impact the results of our 
study. While we have tried to include any confounders we felt might affect our qualitative find- 
ings, any effect sizes should be understood as specific to the simulation conditions under study. 



A Derivation of Hansen-Hurwitz Estimators 

Consider the Hansen-Hurwitz estimator of Ha, 

Ha = ^^^^VTi 

= riA- 

Thus, taking the ratio of Hansen-Hurwitz estimators for H^ and n^-l-Hs shows that ua/ {nA + 
ns) is a generalised Hansen-Hurwitz estimator of nA/(n^ -|- H^). 
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B Non-response 



By definition [3l Vg = Ngf^ji/Ng and so the population proportion of infected individuals can 
be written as 

Na VbVaNa 



N VbVaNa + VaVbNb 

VbNahr 



VbNahr + VaNbhr 
_ Nahr 

Nahr + Nbhr^ 

as claimed. 
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Table 1: Summary of conditions under which tested respondent behaviours induce bias 

Estimator 



Condition 


Description 


N 


SS 


VH 


SH 


H 


Differential Recruitment 


Non-random coupon-passing among alters 












Within-Group 


Different rates of recruitment to own group 












Infection 


To own infection group 


/ 


/ 


/ 


/ 


/ 


Degree 


To own degree group 


t 










Between-Group 


One group preferentially recruited 












Infection 


Infection group preferred 




/ 


/ 


/ 


/ 


Degree 


Degree group preferred 


t 


t 


t 


t 


t 


Recruitment Effectiveness 


Differential rates of coupon-passing 












Infection Group 


Infection group passes more coupons 








/* 


/* 


D('gT(\^ Grou|) 


Degree groui") passes mor(^ e()U|")ous 


t 


t 


t 


t* 


t* 


Non- Response 


Differential rates of coupon return 












Infection Group 


Infection group returns more coupons 








/ 


/ 


Degree Group 


Degree group returns more coupons 


t 


t 


t 


t 


t 


Low-degree Seeds 


Seeds selected from low degree nodes 


/ 


/ 


/ 


t 





Key 

Estimator shows bias regardless of differential activity 
f Bias only in the presence of differential activity 
* Bias only with large sample fraction 



